2014-12-04 14:14:21
made by Vegard Nygaard,
Department of Tumor Biology
Oslo University Hospital.
Setting some dependencies
library(RColorBrewer)
library(xtable)
library(sva)
set.seed(100)
source("./helper_functions.r")
source("read_input.r")
## Warning: package 'RSQLite' was built under R version 3.1.2
## FILTERING PROBES BY FLAGS
##
##
## FILTERING BY ControlType
##
## FEATURES BEFORE FILTERING: 961
## FEATURES AFTER ControlType FILTERING: 939
## ------------------------------------------------------
## FILTERING BY IsGeneDetected FLAG
##
## FLAG FILTERING OPTIONS - FLAG OK = 1 - limIsGeneDetected: 75 %
## FEATURES AFTER IsGeneDetected FILTERING: 288
## NON Gene Detected : 651
## ------------------------------------------------------
The data matrix has 1573 samples and 266 microRNAs.
print(xtable(table(sampleannotation[, c("provider", "tissue_type")]),
caption="", digits=3),
comment = TRUE,
type = "html",
html.table.attributes="CELLPADDING=5",
include.rownames = TRUE)
| benign | DCIS | invasive | normal | |
|---|---|---|---|---|
| AHUS | 23 | 8 | 55 | 70 |
| UCAM | 8 | 10 | 1283 | 116 |
The tissue is not evenly distributed among the providers. “benign” is not part of Volinia et al.s results, but it is kept for now since it is useful for assessing batch differences.
The data is already combined using the common microRNAs, but QC plots will reveal that batch effects exists.
colorpal = c(brewer.pal(9,"Set1"), brewer.pal(8,"Set2"))
somesamples=vector()
for(p in unique(sampleannotation$provider))
{
for(t in unique(sampleannotation$tissue_type))
{
a = sampleannotation$tissue_type==t & sampleannotation$provider==p
s = sum(a)
if(s>=10)
s=10
somesamples = c(somesamples,sample(sampleannotation$sample_id[a], s) )
}
}
boxplot(log2(common_matrix[,somesamples ]), names=NA,
col=colorpal[as.factor(sampleannotation[somesamples, "tissue_type"])])
legend("topleft", legend=unique(sampleannotation[somesamples, "tissue_type"]),
text.col=colorpal[ as.factor(unique(sampleannotation[somesamples, "tissue_type"]))])
The box-plot of a few random samples shows a difference in signal distribution between AHUS and UCAM, but the tissue types seems to have a similar distribution within each provider.
hist( log2(common_matrix[,sampleannotation$provider=="AHUS"]),
breaks=100, freq=FALSE, border="red",
main="Raw intesity distributions", xlab="log2 signal" )
hist( log2(common_matrix[,sampleannotation$provider=="UCAM"]), breaks=100, freq=FALSE, border="blue", add=T )
legend("topright", legend=c("AHUS", "UCAM"), text.col=c("red", "blue"))
The density plot also reveals the difference and the data from the different providers seems to have been processed differently.
#common_matrix_qnorm = normalizeBetweenArrays(common_matrix, method="quantile")
#common_matrix_qnorm = common_matrix
# x = rowMeans(common_matrix[, sampleannotation$provider=="AHUS"])
# y = unlist(apply(common_matrix[, sampleannotation$provider=="AHUS"], MARGIN=1, FUN=sd))
# plot( smooth(y[order(x)]), col="red", type="l")
#
# x = rowMeans(common_matrix[, sampleannotation$provider=="UCAM"])
# y = unlist(apply(common_matrix[, sampleannotation$provider=="UCAM"], MARGIN=1, FUN=sd))
# lines( smooth(y[order(x)]), col="blue")
#
#
# for (i in 3:48)
# {
# qqnorm (data[,i]); qqline(data[,i], col = 2)
# }
# qqnorm (common_matrix[,3]); common_matrix(data[,3], col = 2)
# hist(data[,13])
#
# index = index+1
# a = sampleannotation$provider=="AHUS" & sampleannotation$tissue_type=="invasive"
# qqnorm (common_matrix[index, a])
# qqline(common_matrix[index, a], col = 2)
Next we will see how the samples cluster.
plotmanypca(list( filtered=common_matrix),sampleannotation, "" , views=c("tissue_provider"))
The PCA plot shows that the samples are segregated on provider rather than tissue. Event thought both data sets use Agilent arrays with a lot of the same probes, a clear batch effect is observed. Our tissues and other groupings are not evenly balanced across the data sets, thus ignoring the batch effect would probably lead to false results. Accounting for the batch effect is best done by including the batch information in the statistical tests. This is possible in Limma where batch can be used as a blocking factor much the same way as a two-way ANOVA. Another approach is to analyse the data sets separately and combine the results with a meta-analysis tool.
However, in an attempt to plot and assess the grouping of the samples, the batch effect can partially be adjusted for using the “ComBat” method from the sva-package. This adjusted matrix will only be useful for plotting, since the confidence for the group differences will be overrated.
mod = as.factor(sampleannotation$tissue_type)
common_matrix_batchnorm = ComBat(dat=normalizeBetweenArrays(common_matrix, method="quantile"),
batch=as.factor(sampleannotation$provider),
mod=mod, par.prior=TRUE, prior.plots=FALSE)
## Found 2 batches
## Found 1 categorical covariate(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
After batch adjustment in ComBat the PCA-plot looks like this;
plotmanypca(list(batchnorm=common_matrix_batchnorm),sampleannotation, "" , views=c("tissue_type"))
But be aware that we instructed the batch adjustment to preserve what looks like group differences (but might not be). So the group similarities are unreliable. These plots are useful for checking the groupings of other sample annotations like IHC and pam50 classification.
Pam50 classification has been done on the mRNA samples. Many of the microRNA samples also have a mRNA sample and thus a pam50 classification. We have too few samples for looking at the DCIS to invasive change for each of the pam50 classes. Therefore we will treat all DCIS as “DCIS”.
sampleannotation$pam50_est[sampleannotation$tissue_type=="DCIS"]="DCIS"
table(sampleannotation[,c("pam50_est", "tissue_type")], useNA="ifany")
## tissue_type
## pam50_est benign DCIS invasive normal
## Basallike 2 0 198 0
## DCIS 0 18 0 0
## Her2 0 0 167 0
## LumA 3 0 476 9
## LumB 0 0 385 1
## Normallike 12 0 97 105
## unknown 14 0 15 71
PCA-plot with the pam50 labels. DCIS and invasive samples only.
plotmanypca(list( stratified=common_matrix_batchnorm[,sampleannotation$tissue_type %in% c("DCIS", "invasive")]),
sampleannotation[sampleannotation$tissue_type %in% c("DCIS", "invasive"),], "" , views=c("pam50_est"))
This looks like a separation.
Many of the samples are also diagnosed based on immunohistochemistry of HER2, PGR and ER proteins. Four diagnoses are used, “HER2neg/ERneg/PGRneg”, “HER2neg/ERpos”, “HER2pos/ERneg”, “HER2pos/ERpos”. We would like too look for microRNA that separates samples based on that grouping as well.
sampleannotation$IHC[sampleannotation$IHC==""] = "Unknown"
table(sampleannotation[,c("tissue_type", "IHC")], useNA="ifany")
## IHC
## tissue_type HER2neg_ERneg_PGRneg HER2neg_ERpos HER2pos_ERneg HER2pos_ERpos
## benign 5 2 0 0
## DCIS 2 3 5 0
## invasive 199 940 87 93
## normal 0 0 0 0
## IHC
## tissue_type Unknown
## benign 24
## DCIS 8
## invasive 19
## normal 186
PCA-plot with the IHC labels. DCIS and invasive samples only.
plotmanypca(list( stratified=common_matrix_batchnorm[,sampleannotation$tissue_type %in% c("DCIS", "invasive")]),
sampleannotation[sampleannotation$tissue_type %in% c("DCIS", "invasive"),], "" , views=c("IHC") )
The IHC labels do not seem to cluster.
Breast cancer signatures for invasiveness and prognosis defined by deep sequencing of microRNA.
Volinia S, Galasso M, Sana ME, Wise TF, Palatini J, Huebner K, Croce CM.
Proc Natl Acad Sci U S A. 2012 Feb 21;109(8):3024-9. doi: 10.1073/pnas.1200010109. Epub 2012 Feb 6.
The shaping and functional consequences of the microRNA landscape in breast cancer. Dvinge H, Git A, Gräf S, Salmon-Divon M, Curtis C, Sottoriva A, Zhao Y, Hirst M, Armisen J, Miska EA, Chin SF, Provenzano E, Turashvili G, Green A, Ellis I, Aparicio S, Caldas C. Nature. 2013 May 16;497(7449):378-82. doi: 10.1038/nature12108. Epub 2013 May 5.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Yihui Xie (2013). knitr: A general-purpose package for dynamic report generation in R. R package version 1.5.
Yihui Xie (2013) Dynamic Documents with R and knitr. Chapman and Hall/CRC. ISBN 978-1482203530
Yihui Xie (2013) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
RStudio Team (2012). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] AgiMicroRna_2.14.0 affycoretools_1.36.1 GO.db_2.14.0
[4] RSQLite_1.0.0 DBI_0.3.1 AnnotationDbi_1.26.1
[7] GenomeInfoDb_1.0.2 preprocessCore_1.26.1 affy_1.42.3
[10] limma_3.20.9 Biobase_2.24.0 BiocGenerics_0.10.0
[13] sva_3.10.0 mgcv_1.8-3 nlme_3.1-118
[16] corpcor_1.6.7 xtable_1.7-4 RColorBrewer_1.0-5
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 affyio_1.32.0
[3] annaffy_1.36.0 annotate_1.42.1
[5] AnnotationForge_1.6.1 base64enc_0.1-2
[7] BatchJobs_1.4 BBmisc_1.7
[9] BiocInstaller_1.14.3 BiocParallel_0.6.1
[11] biomaRt_2.20.0 Biostrings_2.32.1
[13] biovizBase_1.12.3 bit_1.1-12
[15] bitops_1.0-6 brew_1.0-6
[17] BSgenome_1.32.0 Category_2.30.0
[19] caTools_1.17.1 checkmate_1.5.0
[21] cluster_1.15.3 codetools_0.2-9
[23] colorspace_1.2-4 DESeq2_1.4.5
[25] dichromat_2.0-0 digest_0.6.4
[27] edgeR_3.6.8 evaluate_0.5.5
[29] fail_1.2 ff_2.2-13
[31] foreach_1.4.2 foreign_0.8-61
[33] formatR_1.0 Formula_1.1-2
[35] gcrma_2.36.0 gdata_2.13.3
[37] genefilter_1.46.1 geneplotter_1.42.0
[39] GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[41] GenomicRanges_1.16.4 ggbio_1.12.10
[43] ggplot2_1.0.0 GOstats_2.30.0
[45] gplots_2.14.2 graph_1.42.0
[47] grid_3.1.1 gridExtra_0.9.1
[49] GSEABase_1.26.0 gtable_0.1.2
[51] gtools_3.4.1 Hmisc_3.14-5
[53] htmltools_0.2.6 hwriter_1.3.2
[55] IRanges_1.22.10 iterators_1.0.7
[57] KernSmooth_2.23-13 knitr_1.8
[59] lattice_0.20-29 latticeExtra_0.6-26
[61] locfit_1.5-9.1 MASS_7.3-35
[63] Matrix_1.1-4 munsell_0.4.2
[65] nnet_7.3-8 oligoClasses_1.26.0
[67] PFAM.db_2.14.0 plyr_1.8.1
[69] proto_0.3-10 R.methodsS3_1.6.1
[71] R.oo_1.18.0 R.utils_1.34.0
[73] R2HTML_2.3.1 RBGL_1.40.1
[75] Rcpp_0.11.3 RcppArmadillo_0.4.450.1.0
[77] RCurl_1.95-4.3 ReportingTools_2.4.0
[79] reshape2_1.4 rmarkdown_0.3.3
[81] rpart_4.1-8 Rsamtools_1.16.1
[83] rtracklayer_1.24.2 scales_0.2.4
[85] sendmailR_1.2-1 splines_3.1.1
[87] stats4_3.1.1 stringr_0.6.2
[89] survival_2.37-7 tools_3.1.1
[91] VariantAnnotation_1.10.5 XML_3.98-1.1
[93] XVector_0.4.0 zlibbioc_1.10.0
generation ended 2014-12-04 14:14:47. Time spent 0 minutes .